GEM-PRO - SBML Model¶
This notebook gives an example of how to run the GEM-PRO pipeline with a SBML model, in this case iNJ661, the metabolic model of M. tuberculosis.
Imports¶
In [1]:
import sys
import logging
In [2]:
# Import the GEM-PRO class
from ssbio.pipeline.gempro import GEMPRO
In [3]:
# Printing multiple outputs per cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
Logging¶
Set the logging level in logger.setLevel(logging.<LEVEL_HERE>)
to
specify how verbose you want the pipeline to be. Debug is most verbose.
CRITICAL
- Only really important messages shown
ERROR
- Major errors
WARNING
- Warnings that don’t affect running of the pipeline
INFO
(default)- Info such as the number of structures mapped per gene
DEBUG
- Really detailed information that will print out a lot of stuff
DEBUG
mode prints out a large amount of information,
especially if you have a lot of genes. This may stall your notebook!
In [4]:
# Create logger
logger = logging.getLogger()
logger.setLevel(logging.INFO) # SET YOUR LOGGING LEVEL HERE #
In [5]:
# Other logger stuff for Jupyter notebooks
handler = logging.StreamHandler(sys.stderr)
formatter = logging.Formatter('[%(asctime)s] [%(name)s] %(levelname)s: %(message)s', datefmt="%Y-%m-%d %H:%M")
handler.setFormatter(formatter)
logger.handlers = [handler]
Initialization of the project¶
Set these three things:
ROOT_DIR
- The directory where a folder named after your
PROJECT
will be created
- The directory where a folder named after your
PROJECT
- Your project name
LIST_OF_GENES
- Your list of gene IDs
A directory will be created in ROOT_DIR
with your PROJECT
name.
The folders are organized like so:
ROOT_DIR
└── PROJECT
├── data # General storage for pipeline outputs
├── model # SBML and GEM-PRO models are stored here
├── genes # Per gene information
│ ├── <gene_id1> # Specific gene directory
│ │ └── protein
│ │ ├── sequences # Protein sequence files, alignments, etc.
│ │ └── structures # Protein structure files, calculations, etc.
│ └── <gene_id2>
│ └── protein
│ ├── sequences
│ └── structures
├── reactions # Per reaction information
│ └── <reaction_id1> # Specific reaction directory
│ └── complex
│ └── structures # Protein complex files
└── metabolites # Per metabolite information
└── <metabolite_id1> # Specific metabolite directory
└── chemical
└── structures # Metabolite 2D and 3D structure files
In [6]:
# SET FOLDERS AND DATA HERE
import tempfile
ROOT_DIR = tempfile.gettempdir()
PROJECT = 'mtuberculosis_gp'
GEM_FILE = '../../ssbio/test/test_files/models/iNJ661.json'
GEM_FILE_TYPE = 'json'
PDB_FILE_TYPE = 'mmtf'
In [7]:
# Create the GEM-PRO project
my_gempro = GEMPRO(gem_name=PROJECT, root_dir=ROOT_DIR, gem_file_path=GEM_FILE, gem_file_type=GEM_FILE_TYPE, pdb_file_type=PDB_FILE_TYPE)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: Creating GEM-PRO project directory in folder /tmp
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: /tmp/mtuberculosis_gp: GEM-PRO project location
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: iNJ661: loaded model
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 1025: number of reactions
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 720: number of reactions linked to a gene
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes (excluding spontaneous)
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 826: number of metabolites
[2018-02-05 18:13] [ssbio.pipeline.gempro] WARNING: IMPORTANT: All Gene objects have been transformed into GenePro objects, and will be for any new ones
[2018-02-05 18:13] [ssbio.pipeline.gempro] INFO: 661: number of genes
Mapping gene ID –> sequence¶
First, we need to map these IDs to their protein sequences. There are 2 ID mapping services provided to do this - through KEGG or UniProt. The end goal is to map a UniProt ID to each ID, since there is a comprehensive mapping (and some useful APIs) between UniProt and the PDB.
However, you don’t need to map using these services if you already have
the amino acid sequences for each protein. You can just manually load in
the sequences as shown using the method manual_seq_mapping
. Or, if
you already have the UniProt IDs, you can load those in using the method
manual_uniprot_mapping
.
Methods¶
-
GEMPRO.
manual_seq_mapping
(gene_to_seq_dict, outdir=None, write_fasta_files=True, set_as_representative=True)[source] Read a manual input dictionary of model gene IDs –> protein sequences. By default sets them as representative.
Parameters: - gene_to_seq_dict (dict) – Mapping of gene IDs to their protein sequence strings
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- write_fasta_files (bool) – If individual protein FASTA files should be written out
- set_as_representative (bool) – If mapped sequences should be set as representative
In [8]:
gene_to_seq_dict = {'Rv1295': 'MTVPPTATHQPWPGVIAAYRDRLPVGDDWTPVTLLEGGTPLIAATNLSKQTGCTIHLKVEGLNPTGSFKDRGMTMAVTDALAHGQRAVLCASTGNTSASAAAYAARAGITCAVLIPQGKIAMGKLAQAVMHGAKIIQIDGNFDDCLELARKMAADFPTISLVNSVNPVRIEGQKTAAFEIVDVLGTAPDVHALPVGNAGNITAYWKGYTEYHQLGLIDKLPRMLGTQAAGAAPLVLGEPVSHPETIATAIRIGSPASWTSAVEAQQQSKGRFLAASDEEILAAYHLVARVEGVFVEPASAASIAGLLKAIDDGWVARGSTVVCTVTGNGLKDPDTALKDMPSVSPVPVDPVAVVEKLGLA',
'Rv2233': 'VSSPRERRPASQAPRLSRRPPAHQTSRSSPDTTAPTGSGLSNRFVNDNGIVTDTTASGTNCPPPPRAAARRASSPGESPQLVIFDLDGTLTDSARGIVSSFRHALNHIGAPVPEGDLATHIVGPPMHETLRAMGLGESAEEAIVAYRADYSARGWAMNSLFDGIGPLLADLRTAGVRLAVATSKAEPTARRILRHFGIEQHFEVIAGASTDGSRGSKVDVLAHALAQLRPLPERLVMVGDRSHDVDGAAAHGIDTVVVGWGYGRADFIDKTSTTVVTHAATIDELREALGV'}
my_gempro.manual_seq_mapping(gene_to_seq_dict)
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Loaded in 2 sequences
-
GEMPRO.
manual_uniprot_mapping
(gene_to_uniprot_dict, outdir=None, set_as_representative=True)[source] Read a manual dictionary of model gene IDs –> UniProt IDs. By default sets them as representative.
This allows for mapping of the missing genes, or overriding of automatic mappings.
Input a dictionary of:
{ <gene_id1>: <uniprot_id1>, <gene_id2>: <uniprot_id2>, }
Parameters: - gene_to_uniprot_dict – Dictionary of mappings as shown above
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
In [9]:
manual_uniprot_dict = {'Rv1755c': 'P9WIA9', 'Rv2321c': 'P71891', 'Rv0619': 'Q79FY3', 'Rv0618': 'Q79FY4', 'Rv2322c': 'P71890'}
my_gempro.manual_uniprot_mapping(manual_uniprot_dict)
my_gempro.df_uniprot_metadata.tail(4)
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed manual ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Out[9]:
uniprot | reviewed | gene_name | kegg | refseq | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||
Rv0619 | Q79FY3 | False | galTb | NaN | NaN | PF02744 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 78 | 2004-07-05 | 1 | Q79FY3.fasta | Q79FY3.xml |
Rv1755c | P9WIA9 | False | plcD | NaN | NaN | PF04185 | Phospholipase C 4 | 2017-07-05 | 18 | 2014-04-16 | 1 | P9WIA9.fasta | P9WIA9.xml |
Rv2321c | P71891 | False | rocD2 | mtv:RVBD_2321c | WP_003411956.1 | PF00202 | Probable ornithine aminotransferase (C-terminu... | 2017-07-05 | 116 | 1997-02-01 | 1 | P71891.fasta | P71891.xml |
Rv2322c | P71890 | False | rocD1 | mtv:RVBD_2322c | WP_003411957.1 | PF00202 | Probable ornithine aminotransferase (N-terminu... | 2017-06-07 | 117 | 1997-02-01 | 1 | P71890.fasta | P71890.xml |
-
GEMPRO.
kegg_mapping_and_metadata
(kegg_organism_code, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to KEGG IDs using the KEGG service.
- Steps:
- Download all metadata and sequence files in the sequences directory
- Creates a KEGGProp object in the protein.sequences attribute
- Returns a Pandas DataFrame of mapping results
Parameters: - kegg_organism_code (str) – The three letter KEGG code of your organism
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model gene IDs.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped KEGG IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
In [10]:
# KEGG mapping of gene ids
my_gempro.kegg_mapping_and_metadata(kegg_organism_code='mtu')
print('Missing KEGG mapping: ', my_gempro.missing_kegg_mapping)
my_gempro.df_kegg_metadata.head()
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv1755c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2233: no metadata file available
[2018-02-05 18:14] [ssbio.core.protein] WARNING: Rv2233: representative sequence does not match mapped KEGG sequence.
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0619: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv0618: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2321c: no metadata file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no sequence file available
[2018-02-05 18:14] [root] WARNING: status is not ok with Not Found
[2018-02-05 18:14] [ssbio.databases.kegg] WARNING: mtu:Rv2322c: no metadata file available
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: 655/661: number of genes mapped to KEGG
[2018-02-05 18:14] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> KEGG. See the "df_kegg_metadata" attribute for a summary dataframe.
Missing KEGG mapping: ['Rv0618', 'Rv2233', 'Rv0619', 'Rv1755c', 'Rv2322c', 'Rv2321c']
Out[10]:
kegg | refseq | uniprot | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|---|
gene | ||||||
Rv0013 | mtu:Rv0013 | YP_177615 | I6WX77 | NaN | mtu-Rv0013.faa | mtu-Rv0013.kegg |
Rv0032 | mtu:Rv0032 | NP_214546 | I6Y6Q7 | NaN | mtu-Rv0032.faa | mtu-Rv0032.kegg |
Rv0046c | mtu:Rv0046c | NP_214560 | I6X8D3 | 1GR0 | mtu-Rv0046c.faa | mtu-Rv0046c.kegg |
Rv0066c | mtu:Rv0066c | NP_214580 | L0T2B7 | NaN | mtu-Rv0066c.faa | mtu-Rv0066c.kegg |
Rv0069c | mtu:Rv0069c | NP_214583 | A0A089S0Y8 | NaN | mtu-Rv0069c.faa | mtu-Rv0069c.kegg |
-
GEMPRO.
uniprot_mapping_and_metadata
(model_gene_source, custom_gene_mapping=None, outdir=None, set_as_representative=False, force_rerun=False)[source] Map all genes in the model to UniProt IDs using the UniProt mapping service. Also download all metadata and sequences.
Parameters: - model_gene_source (str) –
the database source of your model gene IDs. See: http://www.uniprot.org/help/api_idmapping Common model gene sources are:
- Ensembl Genomes -
ENSEMBLGENOME_ID
(i.e. E. coli b-numbers) - Entrez Gene (GeneID) -
P_ENTREZGENEID
- RefSeq Protein -
P_REFSEQ_AC
- Ensembl Genomes -
- custom_gene_mapping (dict) – If your model genes differ from the gene IDs you want to map, custom_gene_mapping allows you to input a dictionary which maps model gene IDs to new ones. Dictionary keys must match model genes.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- set_as_representative (bool) – If mapped UniProt IDs should be set as representative sequences
- force_rerun (bool) – If you want to overwrite any existing mappings and files
- model_gene_source (str) –
In [11]:
# UniProt mapping
my_gempro.uniprot_mapping_and_metadata(model_gene_source='TUBERCULIST_ID')
print('Missing UniProt mapping: ', my_gempro.missing_uniprot_mapping)
my_gempro.df_uniprot_metadata.head()
[2018-02-05 18:14] [root] INFO: getUserAgent: Begin
[2018-02-05 18:14] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:14] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes mapped to UniProt
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed ID mapping --> UniProt. See the "df_uniprot_metadata" attribute for a summary dataframe.
Missing UniProt mapping: ['Rv2178c', 'Rv2476c', 'Rv3607c', 'Rv1202', 'Rv1030', 'Rv1187', 'Rv2421c', 'Rv2247', 'Rv2701c', 'Rv1662', 'Rv1380', 'Rv2763c', 'Rv2435c', 'Rv0082', 'Rv0189c', 'Rv3757c', 'Rv1133c', 'Rv1257c', 'Rv1623c', 'Rv0958', 'Rv1162', 'Rv3581c', 'Rv1739c', 'Rv1086', 'Rv1077', 'Rv1625c', 'Rv3227', 'Rv2832c', 'Rv3308', 'Rv2858c', 'Rv2193', 'Rv2834c', 'Rv2995c', 'Rv1915', 'Rv3003c', 'Rv1538c', 'Rv2793c', 'Rv2152c', 'Rv3468c', 'Rv1485', 'Rv1079', 'Rv1552', 'Rv1161', 'Rv3307', 'Rv0768', 'Rv1843c', 'Rv1302', 'Rv0780', 'Rv2064', 'Rv2124c', 'Rv1908c', 'Rv1555', 'Rv2754c', 'Rv0346c', 'Rv2467', 'Rv2384', 'Rv3758c', 'Rv1338', 'Rv1311', 'Rv1599', 'Rv1589', 'Rv2930', 'Rv0564c', 'Rv2070c', 'Rv0143c', 'Rv0107c', 'Rv2157c', 'Rv2847c', 'Rv1416', 'Rv0147', 'Rv0432', 'Rv3152', 'Rv0098', 'Rv0548c', 'Rv1408', 'Rv2320c', 'Rv2859c', 'Rv3314c', 'Rv3331', 'Rv3393', 'Rv0509', 'Rv2062c', 'Rv1248c', 'Rv1895', 'Rv2870c', 'Rv1131', 'Rv1294', 'Rv0337c', 'Rv0650', 'Rv1001', 'Rv2211c', 'Rv2471', 'Rv2075c', 'Rv1622c', 'Rv3045', 'Rv0772', 'Rv1607', 'Rv2601', 'Rv2363', 'Rv2483c', 'Rv1553', 'Rv1559', 'Rv3464', 'Rv3010c', 'Rv2552c', 'Rv1031', 'Rv1905c', 'Rv3815c', 'Rv2935', 'Rv3469c', 'Rv1595', 'Rv1293', 'Rv1099c', 'Rv3398c', 'Rv2764c', 'Rv1604', 'Rv2399c', 'Rv1695', 'Rv0253', 'Rv0046c', 'Rv3634c', 'Rv2982c', 'Rv0414c', 'Rv3410c', 'Rv3042c', 'Rv1659', 'Rv0013', 'Rv3582c', 'Rv2051c', 'Rv1082', 'Rv1373', 'Rv1570', 'Rv2236c', 'Rv1093', 'Rv1445c', 'Rv3602c', 'Rv2502c', 'Rv1436', 'Rv1849', 'Rv2382c', 'Rv0952', 'Rv2139', 'Rv0956', 'Rv3283', 'Rv3310', 'Rv0373c', 'Rv2243', 'Rv1612', 'Rv1307', 'Rv3229c', 'Rv3265c', 'Rv2220', 'Rv2786c', 'Rv0382c', 'Rv2583c', 'Rv2678c', 'Rv0896', 'Rv1207', 'Rv0183', 'Rv3913', 'Rv2940c', 'Rv3153', 'Rv1731', 'Rv0973c', 'Rv3247c', 'Rv0728c', 'Rv2987c', 'Rv3846', 'Rv3150', 'Rv0234c', 'Rv3794', 'Rv3436c', 'Rv2427c', 'Rv3257c', 'Rv2539c', 'Rv2965c', 'Rv2156c', 'Rv3791', 'Rv1296', 'Rv2287', 'Rv2899c', 'Rv3275c', 'Rv1837c', 'Rv1285', 'Rv1602', 'Rv2245', 'Rv0423c', 'Rv2208', 'Rv3801c', 'Rv0436c', 'Rv1492', 'Rv2780', 'Rv1658', 'Rv2674', 'Rv2329c', 'Rv3509c', 'Rv3290c', 'Rv0620', 'Rv1383', 'Rv1850', 'Rv0112', 'Rv2996c', 'Rv3396c', 'Rv2671', 'Rv1653', 'Rv2496c', 'Rv3330', 'Rv1663', 'Rv0820', 'Rv1448c', 'Rv2205c', 'Rv1447c', 'Rv2043c', 'Rv3806c', 'Rv2029c', 'Rv3792', 'Rv3313c', 'Rv1822', 'Rv3470c', 'Rv1609', 'Rv0103c', 'Rv3264c', 'Rv1484', 'Rv3534c', 'Rv1603', 'Rv0803', 'Rv2849c', 'Rv2458', 'Rv3341', 'Rv1617', 'Rv0570', 'Rv3609c', 'Rv1512', 'Rv1672c', 'Rv3759c', 'Rv2201', 'Rv0855', 'Rv3236c', 'Rv1940', 'Rv2163c', 'Rv0374c', 'Rv1563c', 'Rv0375c', 'Rv2455c', 'Rv1310', 'Rv1295', 'Rv2702', 'Rv3356c', 'Rv2445c', 'Rv3273', 'Rv1306', 'Rv3423c', 'Rv2589', 'Rv0317c', 'Rv3303c', 'Rv2439c', 'Rv1304', 'Rv3248c', 'Rv0858c', 'Rv0534c', 'Rv3158', 'Rv1605', 'Rv1127c', 'Rv1692', 'Rv1655', 'Rv2192c', 'Rv1391', 'Rv1122', 'Rv2454c', 'Rv2573', 'Rv2949c', 'Rv2379c', 'Rv1240', 'Rv1098c', 'Rv2196', 'Rv0555', 'Rv1350', 'Rv3302c', 'Rv1309', 'Rv0267', 'Rv0255c', 'Rv2194', 'Rv3051c', 'Rv0524', 'Rv2590', 'Rv2465c', 'Rv0137c', 'Rv1601', 'Rv0437c', 'Rv1554', 'Rv1318c', 'Rv3340', 'Rv1389', 'Rv2934', 'Rv0545c', 'Rv0558', 'Rv1005c', 'Rv1412', 'Rv2127', 'Rv2335', 'Rv2392', 'Rv2981c', 'Rv3709c', 'Rv3754', 'Rv2612c', 'Rv2931', 'Rv0557', 'Rv0252', 'Rv2928', 'Rv1347c', 'Rv1449c', 'Rv1600', 'Rv2207', 'Rv0536', 'Rv0295c', 'Rv3455c', 'Rv3411c', 'Rv0503c', 'Rv3708c', 'Rv0417', 'Rv1704c', 'Rv0489', 'Rv3795', 'Rv0553', 'Rv0573c', 'Rv2498c', 'Rv0470c', 'Rv0773c', 'Rv3048c', 'Rv2933', 'Rv3704c', 'Rv3818', 'Rv1737c', 'Rv0266c', 'Rv0644c', 'Rv2130c', 'Rv3280', 'Rv2400c', 'Rv2182c', 'Rv1018c', 'Rv2158c', 'Rv3149', 'Rv2289', 'Rv0126', 'Rv2967c', 'Rv3214', 'Rv2317', 'Rv3490', 'Rv2881c', 'Rv1916', 'Rv1438', 'Rv1348', 'Rv3808c', 'Rv0334', 'Rv1594', 'Rv3156', 'Rv2072c', 'Rv2344c', 'Rv2210c', 'Rv2281', 'Rv0993', 'Rv2316', 'Rv1406', 'Rv1613', 'Rv1201c', 'Rv0091', 'Rv3318', 'Rv3285', 'Rv0409', 'Rv0974c', 'Rv3316', 'Rv2121c', 'Rv2833c', 'Rv3309c', 'Rv1328', 'Rv3266c', 'Rv1213', 'Rv2697c', 'Rv0946c', 'Rv1832', 'Rv3793', 'Rv1011', 'Rv3696c', 'Rv2246', 'Rv3858c', 'Rv1620c', 'Rv2386c', 'Rv1618', 'Rv0211', 'Rv1631', 'Rv0512', 'Rv1385', 'Rv1286', 'Rv2122c', 'Rv3319', 'Rv2249c', 'Rv2835c', 'Rv2941', 'Rv2605c', 'Rv2394', 'Rv3145', 'Rv0886', 'Rv3157', 'Rv3737', 'Rv1094', 'Rv1820', 'Rv0500', 'Rv1236', 'Rv1409', 'Rv2136c', 'Rv3800c', 'Rv2984', 'Rv0118c', 'Rv1121', 'Rv2932', 'Rv1826', 'Rv2233', 'Rv1511', 'Rv2540c', 'Rv3838c', 'Rv1305', 'Rv2200c', 'Rv3146', 'Rv0389', 'Rv0848', 'Rv3279c', 'Rv1529', 'Rv0533c', 'Rv0951', 'Rv0884c', 'Rv1308', 'Rv0482', 'Rv2259', 'Rv0727c', 'Rv0261c', 'Rv3281', 'Rv3756c', 'Rv1475c', 'Rv0462', 'Rv1551', 'Rv0805', 'Rv1656', 'Rv2398c', 'Rv1392', 'Rv1170', 'Rv2438c', 'Rv3379c', 'Rv1562c', 'Rv0066c', 'Rv2726c', 'Rv1844c', 'Rv1699', 'Rv2231c', 'Rv3002c', 'Rv3441c', 'Rv1029', 'Rv3001c', 'Rv1415', 'Rv3148', 'Rv2378c', 'Rv2958c', 'Rv2900c', 'Rv0505c', 'Rv3043c', 'Rv1200', 'Rv1185c', 'Rv2531c', 'Rv3772', 'Rv2947c', 'Rv1654', 'Rv0486', 'Rv2436', 'Rv2977c', 'Rv0511', 'Rv1848', 'Rv1237', 'Rv3601c', 'Rv0733', 'Rv3155', 'Rv0753c', 'Rv0694', 'Rv3215', 'Rv3710', 'Rv3606c', 'Rv0254c', 'Rv3315c', 'Rv3713', 'Rv0155', 'Rv0859', 'Rv3317', 'Rv2222c', 'Rv2495c', 'Rv1745c', 'Rv2497c', 'Rv0771', 'Rv1320c', 'Rv3842c', 'Rv1349', 'Rv2773c', 'Rv1164', 'Rv1381', 'Rv0363c', 'Rv2682c', 'Rv3113', 'Rv3154', 'Rv2860c', 'Rv3588c', 'Rv3608c', 'Rv3535c', 'Rv2318', 'Rv0777', 'Rv0729', 'Rv1714', 'Rv0645c', 'Rv0853c', 'Rv0642c', 'Rv0162c', 'Rv2992c', 'Rv3068c', 'Rv1621c', 'Rv0522', 'Rv1568', 'Rv3465', 'Rv2584c', 'Rv2746c', 'Rv0248c', 'Rv0542c', 'Rv2383c', 'Rv0391', 'Rv2332', 'Rv1264', 'Rv2065', 'Rv1652', 'Rv2291', 'Rv1902c', 'Rv1928c', 'Rv0478', 'Rv0467', 'Rv2713', 'Rv1017c', 'Rv0501', 'Rv0422c', 'Rv1023', 'Rv3624c', 'Rv2071c', 'Rv2964', 'Rv0357c', 'Rv2195', 'Rv1647', 'Rv3293', 'Rv2006', 'Rv3826', 'Rv0499', 'Rv0032', 'Rv2988c', 'Rv2155c', 'Rv3372', 'Rv1712', 'Rv1336', 'Rv0306', 'Rv2397c', 'Rv1611', 'Rv3859c', 'Rv0889c', 'Rv2677c', 'Rv0069c', 'Rv2945c', 'Rv1238', 'Rv0408', 'Rv3276c', 'Rv2361c', 'Rv0510', 'Rv1569', 'Rv2610c', 'Rv0812', 'Rv2962c', 'Rv0649', 'Rv2381c', 'Rv2538c', 'Rv1872c', 'Rv3628', 'Rv0191', 'Rv2920c', 'Rv3777', 'Rv0808', 'Rv2447c', 'Rv2607', 'Rv2611c', 'Rv3790', 'Rv3147', 'Rv2380c', 'Rv2753c', 'Rv1323', 'Rv3255c', 'Rv2202c', 'Rv2443', 'Rv0247c', 'Rv1437', 'Rv3784', 'Rv0156', 'Rv1188', 'Rv0322', 'Rv2957', 'Rv0957', 'Rv3332', 'Rv0157', 'Rv3339c', 'Rv3809c', 'Rv2523c', 'Rv0468', 'Rv0936', 'Rv2501c', 'Rv3106', 'Rv0904c', 'Rv1878', 'Rv1451', 'Rv0794c', 'Rv1885c', 'Rv2524c', 'Rv2153c', 'Rv1606', 'Rv3432c', 'Rv0843', 'Rv0321', 'Rv2883c', 'Rv2334', 'Rv1315', 'Rv0819', 'Rv3667', 'Rv0809', 'Rv1596', 'Rv0084', 'Rv2066', 'Rv2391', 'Rv1319c', 'Rv1163', 'Rv2504c', 'Rv1239c', 'Rv3565', 'Rv2215', 'Rv2855', 'Rv0824c', 'Rv2537c', 'Rv1464', 'Rv2848c', 'Rv2225', 'Rv0070c', 'Rv2241', 'Rv1493', 'Rv1981c', 'Rv2503c', 'Rv2388c', 'Rv1092c', 'Rv1483', 'Rv0788', 'Rv0860']
Out[11]:
uniprot | reviewed | gene_name | kegg | refseq | pfam | description | entry_date | entry_version | seq_date | seq_version | sequence_file | metadata_file | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||||
Rv0618 | Q79FY4 | False | galTa | mtv:RVBD_0618 | WP_003900189.1 | PF01087 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 87 | 2004-07-05 | 1 | Q79FY4.fasta | Q79FY4.xml |
Rv0619 | Q79FY3 | False | galTb | NaN | NaN | PF02744 | Probable galactose-1-phosphate uridylyltransfe... | 2017-07-05 | 78 | 2004-07-05 | 1 | Q79FY3.fasta | Q79FY3.xml |
Rv1755c | P9WIA9 | False | plcD | NaN | NaN | PF04185 | Phospholipase C 4 | 2017-07-05 | 18 | 2014-04-16 | 1 | P9WIA9.fasta | P9WIA9.xml |
Rv2321c | P71891 | False | rocD2 | mtv:RVBD_2321c | WP_003411956.1 | PF00202 | Probable ornithine aminotransferase (C-terminu... | 2017-07-05 | 116 | 1997-02-01 | 1 | P71891.fasta | P71891.xml |
Rv2322c | P71890 | False | rocD1 | mtv:RVBD_2322c | WP_003411957.1 | PF00202 | Probable ornithine aminotransferase (N-terminu... | 2017-06-07 | 117 | 1997-02-01 | 1 | P71890.fasta | P71890.xml |
-
GEMPRO.
set_representative_sequence
(force_rerun=False)[source] Automatically consolidate loaded sequences (manual, UniProt, or KEGG) and set a single representative sequence.
Manually set representative sequences override all existing mappings. UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
Parameters: force_rerun (bool) – Set to True to recheck stored sequences
If you have mapped with both KEGG and UniProt mappers, then you can set a representative sequence for the gene using this function. If you used just one, this will just set that ID as representative.
- If any sequences or IDs were provided manually, these will be set as representative first.
- UniProt mappings override KEGG mappings except when KEGG mappings have PDBs associated with them and UniProt doesn’t.
In [12]:
# Set representative sequences
my_gempro.set_representative_sequence()
print('Missing a representative sequence: ', my_gempro.missing_representative_sequence)
my_gempro.df_representative_sequences.head()
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 661/661: number of genes with a representative sequence
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: See the "df_representative_sequences" attribute for a summary dataframe.
Missing a representative sequence: []
Out[12]:
uniprot | kegg | pdbs | sequence_file | metadata_file | |
---|---|---|---|---|---|
gene | |||||
Rv0013 | I6WX77 | mtu:Rv0013 | NaN | mtu-Rv0013.faa | mtu-Rv0013.kegg |
Rv0032 | I6Y6Q7 | mtu:Rv0032 | NaN | mtu-Rv0032.faa | mtu-Rv0032.kegg |
Rv0046c | I6X8D3 | mtu:Rv0046c | 1GR0 | mtu-Rv0046c.faa | mtu-Rv0046c.kegg |
Rv0066c | L0T2B7 | mtu:Rv0066c | NaN | mtu-Rv0066c.faa | mtu-Rv0066c.kegg |
Rv0069c | A0A089S0Y8 | mtu:Rv0069c | NaN | mtu-Rv0069c.faa | mtu-Rv0069c.kegg |
Mapping representative sequence –> structure¶
These are the ways to map sequence to structure:
- Use the UniProt ID and their automatic mappings to the PDB
- BLAST the sequence to the PDB
- Make homology models or
- Map to existing homology models
You can only utilize option #1 to map to PDBs if there is a mapped UniProt ID set in the representative sequence. If not, you’ll have to BLAST your sequence to the PDB or make a homology model. You can also run both for maximum coverage.
Methods¶
-
GEMPRO.
map_uniprot_to_pdb
(seq_ident_cutoff=0.0, outdir=None, force_rerun=False)[source] Map all representative sequences’ UniProt ID to PDB IDs using the PDBe “Best Structures” API. Will save a JSON file of the results to each protein’s
sequences
folder.The “Best structures” API is available at https://www.ebi.ac.uk/pdbe/api/doc/sifts.html The list of PDB structures mapping to a UniProt accession sorted by coverage of the protein and, if the same, resolution.
Parameters: - seq_ident_cutoff (float) – Sequence identity cutoff in decimal form
- outdir (str) – Output directory to cache JSON results of search
- force_rerun (bool) – Force re-downloading of JSON results if they already exist
Returns: A rank-ordered list of PDBProp objects that map to the UniProt ID
Return type: list
In [13]:
# Mapping using the PDBe best_structures service
my_gempro.map_uniprot_to_pdb(seq_ident_cutoff=.3)
my_gempro.df_pdb_ranking.head()
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Mapping UniProt IDs --> PDB IDs...
[2018-02-05 18:15] [root] INFO: getUserAgent: Begin
[2018-02-05 18:15] [root] INFO: getUserAgent: user_agent: EBI-Sample-Client/ (services.py; Python 3.6.3; Linux) Python-requests/2.18.4
[2018-02-05 18:15] [root] INFO: getUserAgent: End
[2018-02-05 18:15] [root] WARNING: status is not ok with Bad Request
[2018-02-05 18:15] [root] WARNING: Results seems empty...returning empty dictionary.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 0/661: number of genes with at least one experimental structure
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed UniProt --> best PDB mapping. See the "df_pdb_ranking" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] WARNING: Empty dataframe
Out[13]:
-
GEMPRO.
blast_seqs_to_pdb
(seq_ident_cutoff=0, evalue=0.0001, all_genes=False, display_link=False, outdir=None, force_rerun=False)[source] BLAST each representative protein sequence to the PDB. Saves raw BLAST results (XML files).
Parameters: - seq_ident_cutoff (float, optional) – Cutoff results based on percent coverage (in decimal form)
- evalue (float, optional) – Cutoff for the E-value - filters for significant hits. 0.001 is liberal, 0.0001 is stringent (default).
- all_genes (bool) – If all genes should be BLASTed, or only those without any structures currently mapped
- display_link (bool, optional) – Set to True if links to the HTML results should be displayed
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool, optional) – If existing BLAST results should not be used, set to True. Default is False
In [14]:
# Mapping using BLAST
my_gempro.blast_seqs_to_pdb(all_genes=True, seq_ident_cutoff=.9, evalue=0.00001)
my_gempro.df_pdb_blast.head(2)
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: Completed sequence --> PDB BLAST. See the "df_pdb_blast" attribute for a summary dataframe.
[2018-02-05 18:15] [ssbio.pipeline.gempro] INFO: 141: number of genes with additional structures added from BLAST
Out[14]:
pdb_id | pdb_chain_id | hit_score | hit_evalue | hit_percent_similar | hit_percent_ident | hit_num_ident | hit_num_similar | |
---|---|---|---|---|---|---|---|---|
gene | ||||||||
Rv0046c | 1gr0 | A | 1861.0 | 0.0 | 1.000000 | 1.000000 | 367 | 367 |
Rv0066c | 5kvu | D | 3828.0 | 0.0 | 0.981208 | 0.981208 | 731 | 731 |
-
GEMPRO.
get_itasser_models
(homology_raw_dir, custom_itasser_name_mapping=None, outdir=None, force_rerun=False)[source] Copy generated I-TASSER models from a directory to the GEM-PRO directory.
Parameters: - homology_raw_dir (str) – Root directory of I-TASSER folders.
- custom_itasser_name_mapping (dict) – Use this if your I-TASSER folder names differ from your model gene names. Input a dict of {model_gene: ITASSER_folder}.
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [15]:
tb_homology_dir = '/home/nathan/projects_archive/homology_models/MTUBERCULOSIS/'
##### EXAMPLE SPECIFIC CODE #####
# Needed to map to older IDs used in this example
import pandas as pd
import os.path as op
old_gene_to_homology = pd.read_csv(op.join(tb_homology_dir, 'data/161031-old_gene_to_uniprot_mapping.csv'))
gene_to_uniprot = old_gene_to_homology.set_index('m_gene').to_dict()['u_uniprot_acc']
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'), custom_itasser_name_mapping=gene_to_uniprot)
### END EXAMPLE SPECIFIC CODE ###
# Organizing I-TASSER homology models
my_gempro.get_itasser_models(homology_raw_dir=op.join(tb_homology_dir, 'raw'))
my_gempro.df_homology_models.head()
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 435 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Completed copying of 9 I-TASSER models to GEM-PRO directory. See the "df_homology_models" attribute for a summary dataframe.
Out[15]:
id | structure_file | model_date | difficulty | top_template_pdb | top_template_chain | c_score | tm_score | tm_score_err | rmsd | rmsd_err | |
---|---|---|---|---|---|---|---|---|---|---|---|
gene | |||||||||||
Rv0013 | P9WN35 | P9WN35_model1.pdb | 2018-02-06 | easy | 1i7s | B | -0.53 | 0.65 | 0.13 | 6.8 | 4.0 |
Rv0032 | P9WQ85 | P9WQ85_model1.pdb | 2018-02-06 | easy | 3a2b | A | -2.89 | 0.39 | 0.13 | 15.7 | 3.3 |
Rv0066c | O53611 | O53611_model1.pdb | 2018-02-06 | easy | 1itw | A | 1.91 | 0.99 | 0.04 | 4.1 | 2.8 |
Rv0069c | P9WGT5 | P9WGT5_model1.pdb | 2018-02-06 | easy | 4rqo | A | 1.18 | 0.88 | 0.07 | 4.6 | 3.0 |
Rv0070c | P9WGI7 | P9WGI7_model1.pdb | 2018-02-06 | easy | 3h7f | B | 1.80 | 0.97 | 0.05 | 3.3 | 2.3 |
-
GEMPRO.
get_manual_homology_models
(input_dict, outdir=None, clean=True, force_rerun=False)[source] Copy homology models to the GEM-PRO project.
Requires an input of a dictionary formatted like so:
{ model_gene: { homology_model_id1: { 'model_file': '/path/to/homology/model.pdb', 'file_type': 'pdb' 'additional_info': info_value }, homology_model_id2: { 'model_file': '/path/to/homology/model.pdb' 'file_type': 'pdb' } } }
Parameters: - input_dict (dict) – Dictionary of dictionaries of gene names to homology model IDs and other information
- outdir (str) – Path to output directory of downloaded files, must be set if GEM-PRO directories were not created initially
- clean (bool) – If homology files should be cleaned and saved as a new PDB file
- force_rerun (bool) – If homology files should be copied again even if they exist in the GEM-PRO directory
In [16]:
homology_model_dict = {}
my_gempro.get_manual_homology_models(homology_model_dict)
[2018-02-05 18:16] [ssbio.pipeline.gempro] INFO: Updated homology model information for 0 genes.
Downloading and ranking structures¶
Methods¶
-
GEMPRO.
pdb_downloader_and_metadata
(outdir=None, pdb_file_type=None, force_rerun=False)[source] Download ALL mapped experimental structures to each protein’s structures directory.
Parameters: - outdir (str) – Path to output directory, if GEM-PRO directories were not set or other output directory is desired
- pdb_file_type (str) – Type of PDB file to download, if not already set or other format is desired
- force_rerun (bool) – If files should be re-downloaded if they already exist
In [ ]:
# Download all mapped PDBs and gather the metadata
my_gempro.pdb_downloader_and_metadata()
my_gempro.df_pdb_metadata.head(2)
-
GEMPRO.
set_representative_structure
(seq_outdir=None, struct_outdir=None, pdb_file_type=None, engine=’needle’, always_use_homology=False, rez_cutoff=0.0, seq_ident_cutoff=0.5, allow_missing_on_termini=0.2, allow_mutants=True, allow_deletions=False, allow_insertions=False, allow_unresolved=True, skip_large_structures=False, clean=True, force_rerun=False)[source] Set all representative structure for proteins from a structure in the structures attribute.
Each gene can have a combination of the following, which will be analyzed to set a representative structure.
- Homology model(s)
- Ranked PDBs
- BLASTed PDBs
If the
always_use_homology
flag is true, homology models are always set as representative when they exist. If there are multiple homology models, we rank by the percent sequence coverage.Parameters: - seq_outdir (str) – Path to output directory of sequence alignment files, must be set if GEM-PRO directories were not created initially
- struct_outdir (str) – Path to output directory of structure files, must be set if GEM-PRO directories were not created initially
- pdb_file_type (str) –
pdb
,mmCif
,xml
,mmtf
- file type for files downloaded from the PDB - engine (str) –
biopython
orneedle
- which pairwise alignment program to use.needle
is the standard EMBOSS tool to run pairwise alignments.biopython
is Biopython’s implementation of needle. Results can differ! - always_use_homology (bool) – If homology models should always be set as the representative structure
- rez_cutoff (float) – Resolution cutoff, in Angstroms (only if experimental structure)
- seq_ident_cutoff (float) – Percent sequence identity cutoff, in decimal form
- allow_missing_on_termini (float) – Percentage of the total length of the reference sequence which will be ignored when checking for modifications. Example: if 0.1, and reference sequence is 100 AA, then only residues 5 to 95 will be checked for modifications.
- allow_mutants (bool) – If mutations should be allowed or checked for
- allow_deletions (bool) – If deletions should be allowed or checked for
- allow_insertions (bool) – If insertions should be allowed or checked for
- allow_unresolved (bool) – If unresolved residues should be allowed or checked for
- skip_large_structures (bool) – Default False – currently, large structures can’t be saved as a PDB file even if you just want to save a single chain, so Biopython will throw an error when trying to do so. As an alternative, if a large structure is selected as representative, the pipeline will currently point to it and not clean it. If you don’t want this to happen, set this to true.
- clean (bool) – If structures should be cleaned
- force_rerun (bool) – If sequence to structure alignment should be rerun
Todo
- Remedy large structure representative setting
In [17]:
# Set representative structures
my_gempro.set_representative_structure()
my_gempro.df_representative_structures.head()
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0234c: no structures meet quality checks
[2018-02-05 18:16] [ssbio.core.protein] WARNING: Rv0505c: no structures meet quality checks
[2018-02-05 18:17] [ssbio.core.protein] WARNING: Rv2987c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv2498c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.core.protein] WARNING: Rv3601c: no structures meet quality checks
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: 553/661: number of genes with a representative structure
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: See the "df_representative_structures" attribute for a summary dataframe.
Out[17]:
id | is_experimental | file_type | structure_file | |
---|---|---|---|---|
gene | ||||
Rv0013 | REP-P9WN35 | False | pdb | P9WN35_model1-X_clean.pdb |
Rv0032 | REP-P9WQ85 | False | pdb | P9WQ85_model1-X_clean.pdb |
Rv0046c | REP-1gr0 | True | pdb | 1gr0-A_clean.pdb |
Rv0066c | REP-5kvu | True | pdb | 5kvu-A_clean.pdb |
Rv0069c | REP-P9WGT5 | False | pdb | P9WGT5_model1-X_clean.pdb |
In [18]:
# Looking at the information saved within a gene
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure
my_gempro.genes.get_by_id('Rv1295').protein.representative_structure.get_dict()
Out[18]:
<StructProp REP-2d1f at 0x7fdec937e4a8>
Out[18]:
{'_structure_dir': '/tmp/mtuberculosis_gp/genes/Rv1295/Rv1295_protein/structures',
'chains': [<ChainProp A at 0x7fdec8655630>],
'date': None,
'description': 'Threonine synthase (E.C.4.2.3.1)',
'file_type': 'pdb',
'id': 'REP-2d1f',
'is_experimental': True,
'mapped_chains': ['A'],
'notes': {},
'original_structure_id': '2d1f',
'resolution': 2.5,
'structure_file': '2d1f-A_clean.pdb',
'taxonomy_name': 'Mycobacterium tuberculosis'}
Creating homology models¶
For those proteins with no representative structure, we can create
homology models for them. ssbio
contains some built in functions for
easily running
I-TASSER
locally or on machines with SLURM
(ie. on NERSC) or Torque
job
scheduling.
You can load in I-TASSER models once they complete using the
get_itasser_models
later.
Methods¶
In [19]:
# Prep I-TASSER model folders
my_gempro.prep_itasser_modeling('~/software/I-TASSER4.4', '~/software/ITLIB/', runtype='local', all_genes=False)
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2934: I-TASSER modeling will not run as sequence length (1827) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2932: I-TASSER modeling will not run as sequence length (1538) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2933: I-TASSER modeling will not run as sequence length (2188) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2931: I-TASSER modeling will not run as sequence length (1876) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2380c: I-TASSER modeling will not run as sequence length (1682) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3859c: I-TASSER modeling will not run as sequence length (1527) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2476c: I-TASSER modeling will not run as sequence length (1624) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv3800c: I-TASSER modeling will not run as sequence length (1733) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv0107c: I-TASSER modeling will not run as sequence length (1632) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2940c: I-TASSER modeling will not run as sequence length (2111) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv1662: I-TASSER modeling will not run as sequence length (1602) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.protein.structure.homology.itasser.itasserprep] WARNING: Rv2524c: I-TASSER modeling will not run as sequence length (3069) is not in the range [10, 1500]
[2018-02-05 18:18] [ssbio.pipeline.gempro] INFO: Prepared I-TASSER modeling folders for 108 genes in folder /tmp/mtuberculosis_gp/data/homology_models
Saving your GEM-PRO¶
Finally, you can save your GEM-PRO as a JSON
or pickle
file, so
you don’t have to run the pipeline again.
For most functions, if you rerun them, they will check for existing results saved as files. The only function that would take a long time is setting the representative structure, as they are each rechecked and cleaned. This is where saving helps!
-
GEMPRO.
save_pickle
(outfile, protocol=2) Save the object as a pickle file
Parameters: - outfile (str) – Filename
- protocol (int) – Pickle protocol to use. Default is 2 to remain compatible with Python 2
Returns: Path to pickle file
Return type: str
In [20]:
import os.path as op
my_gempro.save_pickle(op.join(my_gempro.model_dir, '{}.pckl'.format(my_gempro.id)))
-
GEMPRO.
save_json
(outfile, compression=False) Save the object as a JSON file using json_tricks
In [21]:
import os.path as op
my_gempro.save_json(op.join(my_gempro.model_dir, '{}.json'.format(my_gempro.id)), compression=False)
[2018-02-05 18:18] [root] WARNING: json-tricks: numpy scalar serialization is experimental and may work differently in future versions
[2018-02-05 18:18] [ssbio.io] INFO: Saved <class 'ssbio.pipeline.gempro.GEMPRO'> (id: mtuberculosis_gp) to /tmp/mtuberculosis_gp/model/mtuberculosis_gp.json
Loading a saved GEM-PRO¶
In [ ]:
# Loading a pickle file
import pickle
with open('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.pckl', 'rb') as f:
my_saved_gempro = pickle.load(f)
In [ ]:
# Loading a JSON file
import ssbio.core.io
my_saved_gempro = ssbio.core.io.load_json('/tmp/mtuberculosis_gp_atlas/model/mtuberculosis_gp_atlas.json', decompression=False)